This article supplement, intended as a pedagogical tool, provides all of the code necessary to reproduce the case study illustration in Computational Methods for Qualitative Research in Criminology and Criminal Justice Studies (work in progress). To boost the pedagogical value of this resource, we have provided detailed explanations and commentaries throughout each step.
The first major step in conducting any web scrape is page exploration/inspection. What the researcher does at this stage is explore the content and structure of the pages they are interested in. The goal is to, first, find the various page elements that one wishes to collect. In this case, we are interested in collecting specific data points from thousands of RCMP news releases, including the title of each news release, date, location of RCMP detachment, and the main text of the news release.
The second goal is to come up with an algorithmic solution or strategy for collecting this information. A key part of this second step is to carefully examine the source code of the website in order to determine what tools or libraries will be necessary to execute the scrape. While simpler websites can be scraped using an R package like library(rvest), more sophisticated websites may require that the researcher use an additional set of tools such as library(RSelenium).
Another key part of constructing an algorithmic solution is to determine whether the information can (or should) be collected in one or multiple stages (where each stage represents a different script). We typically conduct our web scrapes in two stages: the index scrape and the contents scrape.
The first step we took in our index scrape was to create a file (a csv file), on our local computer, that would store the results of our scrape. This is not the only way to do this, but is the way that we find to be most efficient. Another way to do this would be to store the results in the RStudio’s global environment, saving the results to your local computer after the scrape completes. Two major downsides to this approach are (1) you cannot see the results of the scrape until the scrape is completed; (2) if your scrape fails at some point (which it very likely will, especially on longer scrapes), you’ll lose the results you had obtained up to that point. So, let’s start by creating a csv spreadsheet that contains named columns for the data we’ll be collecting (headline_url, headline_text, etc.) in our index scrape. To do this we’ll use the library(tibble), library(readr), and library(tidyr).
# give the file you'll be creating a name
filename <- "rcmp-news-index-scrape.csv"
# using the tibble function, create a dataframe with column headers
create_data <- function(
headline_url = NA,
headline_text = NA,
date_published = NA,
metadata_text = NA,
page_url = NA
) {
tibble(
headline_url = headline_url,
headline_text = headline_text,
date_published = date_published,
metadata_text = metadata_text,
page_url = page_url
)
}
# write tibble to csv
write_csv(create_data() %>% drop_na(), filename, append = TRUE, col_names = TRUE)
Next, let’s write the code for our index scraping algorithm, which will obtain the data from the RCMP’s website and populate the csv file we just created in the last chunk of code. We’ll need an additional library to do this – library(rvest) – which will be used to get and parse the data from the RCMP’s website. To locate the information we want in the HTML, we’ll be specifying the html element that embeds the content we are interested in obtaining (link to full article, date, headline text, etc.). Obtaining these elements is a bit of art. There are developer tools built into every browser to easily obtain them. Another popular way of obtaining them is via the browser plug in “selector gadget”.
base_url <- 'https://www.rcmp-grc.gc.ca/en/news?page='
max_page_num <- NA # note that these pages are zero-indexed
scrape_page <- function(page_num = 0) {
# grab html only once
page_url <- paste(base_url, page_num, sep = '')
curr_page <- read_html(page_url)
# zero in on news list
news_list <- curr_page %>%
html_node('.list-group')
# grab headline nodes
headline_nodes <- news_list %>%
html_nodes('div > div > a')
# use headline nodes to get urls
headline_url <- headline_nodes %>%
html_attr('href') %>%
url_absolute('https://www.rcmp-grc.gc.ca/en/news')
# use headline nodes to get text
headline_text <- headline_nodes %>%
html_text(trim = TRUE)
# grab metadata field
metadata <- news_list %>%
html_nodes('div > div > span.text-muted')
# use metadata field to grab pubdate
date_published <- metadata %>%
html_nodes('meta[itemprop=datePublished]') %>%
html_attr('content')
# use metadata field to grab metadata text
metadata_text <- metadata %>%
html_text(trim = TRUE)
# build a tibble
page_data <- create_data(
headline_url = headline_url,
headline_text = headline_text,
date_published = date_published,
metadata_text = metadata_text,
page_url = page_url
)
# write to csv
write_csv(page_data, filename, append = TRUE)
max_page <- curr_page %>%
html_node('div.contextual-links-region ul.pagination li:nth-last-child(2)') %>%
html_text(trim = TRUE) %>%
as.numeric() %>%
-(1)
max_page_num <- max_page
Sys.sleep(3)
# recur
if ((page_num + 1) <= max_page_num) {
scrape_page(page_num = page_num + 1)
}
}
# run it once
scrape_page()
Let’s inspect the result. To do this we’ll use the paged_table function from library(rmarkdown).
index <- read_csv("rcmp-news-index-scrape.csv")
paged_table(index)
Using the results of index scrape, we can conduct our contents scrape. This will involve visiting each url in our index (headline_url) and grabbing the content we want from each page. We’re going to grab three things from each: the headline url (in order to merge the results of our index and contents scrapes), the full_text of the article, and the link to any images contained in the news release (if there are any). Same as we did with the index scrape, we’ll start by creating a csv file on our local hard drive with named columns that correspond to the information we’re going to collect.
filename <- 'rcmp-news-contents-scrape.csv'
create_data <- function(
headline_url = NA,
full_text = NA,
image_url = NA
) {
tibble(
headline_url = headline_url,
full_text = full_text,
image_url = image_url
)
}
# write once to create headers
write_csv(create_data() %>% drop_na(), filename, append = TRUE, col_names = TRUE)
And now we can write the code for our scrape. To mix it up, we’ll use the lapply() function this time.
index_list <- as.list(index$headline_url)
lapply(index_list, function(i) {
webpage <- read_html(i)
full_text <- html_node(webpage, ".node-news-release > div") %>% html_text(trim = TRUE)
try(image_url <- html_node(webpage, ".img-responsive") %>% html_attr("src"))
if(!is.na(image_url)){
image_url <- image_url %>% url_absolute(i)
}
page_data <- create_data(
headline_url = i,
full_text = full_text,
image_url = image_url
)
write_csv(page_data, filename, append = TRUE)
Sys.sleep(3)
})
Finally, let’s combine the results our index and contents scrape into a single dataframe. We’ll save the combined csv file in our working directory.
# read in the two files
index_scrape <- read_csv("rcmp-news-index-scrape.csv")
contents_scrape <- read_csv("rcmp-news-contents-scrape.csv")
# combine the files using the headline_url column
combined_df <- contents_scrape %>% select(-headline_url) %>% bind_cols(index_scrape)
# save results
write_csv(combined_df, "rcmp-news-df.csv")
#rm(list=ls()) # you may want to clear your global environment at this point
rcmp_news <- read_csv("rcmp-news-df.csv")
Let’s take a look at the results our DataFrame so far. We’ll look at the first two rows from each variable.
paged_table(rcmp_news, options = list(rows.print = 2))